Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.
This command also functions as an update of a universe.
Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.
Create and map the universe.
Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.
Delete the content specified by url.
Delete tweets specified by url.
Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.
A shim to help support using Newspaper’s canonical_link.
If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.
Retrieve description from opengraph, twitter, or meta tags.
Retrieve opengraph image for the resource.
Returns a prettified domain for the resource, stripping www.
Retrieve a published date. This almost never gets anything.
Retrieve title from opengraph, twitter, or meta tags.
Get the top article image.
Retrieve twitter username of the creator.
Retrieve twitter image for the resource from twitter:image.src or twitter.image.
Retrieve default player for twitter cards.
Class to fetch article from a URL.
A shim to help support using Newspaper’s canonical_link.
If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.
Retrieve description from opengraph, twitter, or meta tags.
Retrieve opengraph image for the resource.
Retrieve favicon url from article tags or from http://g.etfv.co
Returns a prettified domain for the resource, stripping www.
Retrieve a published date. This almost never gets anything.
Get the article text.
Retrieve twitter username of the creator.
Retrieve twitter image for the resource from twitter:image.src or twitter.image.
Retrieve default player for twitter cards.
Smartly fetches metadata from a newspaper article, and cleans the results.
Retrieve an author or authors. This works very sporadically.
If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.
Retrieve opengraph image for the resource.
Retrieve favicon url from article tags or from http://g.etfv.co
Returns a prettified domain for the resource, stripping www.
Retrieve a published date. This almost never gets anything.
Retrive a comma-separated list of all keywords, categories,and tags, flattened:
- opengraph tags + sections
- keywords + meta_keywords
- tags
Get the article text.
Retrieve title from opengraph, twitter, or meta tags.
Retrieve twitter username of the creator.
Retrieve twitter image for the resource from twitter:image.src or twitter.image.
Retrieve default player for twitter cards.
Extract metadata from a URL, and return a dict result.
Uses newspaper https://github.com/codelucas/newspaper/, but overrides some defaults in favor of opengraph and twitter elements.
Parameters: | html – if provided, skip downloading and go straight to parsing html. |
---|
Apply an offset in minutes to a given date.
Convert a date string to a python datetime (UTC).
Converts unix timestamp to python datetime (UTC).
Gets the start and end dates for a query. :arg start: datetime to start with.
If not present, defaults to since argument.
Parameters: |
|
---|
Gets the amount of time that has expired since now. Smartly chooses between hours, days, minutes, and seconds. Accepts datetime, unix epoch, or Twitter-formatted datestring.
Parameters: |
|
---|
Return now. :arg stringify: return now as a formatted string.
Convert a datetime to an elasticsearch-formatted datestring (UTC).
Takes an amount and a type. Returns a pluralized string.
Cache a set of results under certain number of hours.
Index a new top link to the given universe.
Create and map the universe.
Delete everything in the universe that is more than days old. Does not apply to top content.
Delete the content specified by url.
Delete tweets specified by url.
Delete a user from the universe index by their id.
Save a tweet to the universe index as an unprocessed tweet document.
Return new-style Elasticsearch client connection for the universe
Helper function to return all values in a certain field. Defaults to retrieving all ids from a given index and doc type.
Parameters: |
|
---|
Get a resolved URL from the index. Returns None if URL doesn’t exist.
The default function: gets the most popular links shared from a given universe and time frame.
Parameters: |
|
---|
Get the most recently added top links in the given universe.
Get extended stats on the scores returned from the results cache. :arg hours: type of query to search for.
Search for any links in the current set that are a high enough score to get into top links. Return one (and only one) if so.
Get a list of all providers (i.e. domains) in order of popularity. Possible future use for autocomplete, to search across publications.
Get tweets in a given universe.
Parameters: |
|
---|
Get top users for the universe by weight. :arg size: number of users to get. Defaults to all users.
Takes a list of user ids and returns a dict with their weighted influence.
Get the next unprocessed tweet and delete it from the index.
Save the content of a URL to the index.
Save a tweet to the universe index, fully processed.
Check if a user exists in the database. If not, create it. If so, update it.
Scores a given link returned from elasticsearch.
Parameters: |
|
---|
Search fulltext of all content across universes for a given string, or a custom match query.
Parameters: |
|
---|
Search the text of both tweets and content for a given term and universe, and return some items matching one or the other.
Parameters: |
|
---|
Index a URL and its resolution in Elasticsearch
An attempt to make the Elasticsearch client a bit more usable. Currently implements search, get, and mget. get_source maps to get because there really should not be a need for both if the API is done correctly. For the time being, other client methods will behave just as they do for the client provided by Elasticsearch.
The implemented methods return ESDocument or ESCollection objects. ESDocument is dot-addressable and dict-like. It’s keys are your document keys. Meta-data is attached to this document with underscored properties: _index, _type, _id, and when applicable: _score, _version, _found. I realize that _ has the general connotation of “privacy” in Python, but this seems to me to be the safest way to have special keys and keep the API fairly usable.
ESCollection is an iterable of those documents and has the special properties: took, timed_out, total_shards, successful_shards, failed_shards. No need for [‘hits’][‘hits’] deferencing – just iterate the collection. Same for [‘docs’] on an mget operation.
v defaults to None.
If key is not found, d is returned if given, otherwise KeyError is raised
2-tuple; but raise KeyError if D is empty.
If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
Aborts a running benchmark. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html
Parameters: | name – A benchmark name |
---|
The benchmark API provides a standard mechanism for submitting queries and measuring their performance relative to one another. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html
Parameters: |
|
---|
Perform many index/delete operations in a single API call. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
See the bulk() helper function for a more friendly API.
Parameters: |
|
---|
Clear the scroll request created by specifying the scroll parameter to search. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
Parameters: |
|
---|
Execute a query and get the number of matches for that query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-count.html
Parameters: |
|
---|
The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html
Parameters: |
|
---|
Adds a typed JSON document in a specific index, making it searchable. Behind the scenes this method calls index(..., op_type=’create’) http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html
Parameters: |
|
---|
Delete a typed JSON document from a specific index based on its id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete.html
Parameters: |
|
---|
Delete documents from one or more indices and one or more types based on a query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
Parameters: |
|
---|
Remove a stored script from elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Parameters: |
|
---|
Delete a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html
Parameters: | id – Template ID |
---|
Returns a boolean indicating whether or not given document exists in Elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html
Parameters: |
|
---|
The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-explain.html
Parameters: |
|
---|
Retrieve a script from the API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Parameters: |
|
---|
Retrieve a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html
Parameters: |
|
---|
Adds or updates a typed JSON document in a specific index, making it searchable. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html
Parameters: |
|
---|
Get the basic info from the current cluster.
View the progress of long-running benchmarks. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html
Parameters: |
|
---|
Get documents that are “like” a specified document. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-more-like-this.html
Parameters: |
|
---|
The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html
Parameters: |
|
---|
Execute several search requests within the same API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-multi-search.html
Parameters: |
|
---|
Multi termvectors API allows to get multiple termvectors based on an index, type and id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/docs-multi-termvectors.html
Parameters: |
|
---|
The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html
Parameters: |
|
---|
Returns True if the cluster is up, False otherwise.
Create a script in given language with specified ID. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Parameters: |
|
---|
Create a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html
Parameters: |
|
---|
Scroll a search request created by specifying the scroll parameter. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
Parameters: |
|
---|
The search shards api returns the indices and shards that a search request would be executed against. This can give useful feedback for working out issues or planning optimizations with routing and shard preferences. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-shards.html
Parameters: |
|
---|
A query that accepts a query template and a map of key/value pairs to fill in template parameters. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/query-dsl-template-query.html
Parameters: |
|
---|
The suggest feature suggests similar looking terms based on a provided text by using a suggester. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-search.html
Parameters: |
|
---|
Added in 1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-termvectors.html
Parameters: |
|
---|
Update a document based on a script or partial data provided. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html
Parameters: |
|
---|
v defaults to None.
If key is not found, d is returned if given, otherwise KeyError is raised
2-tuple; but raise KeyError if D is empty.
If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
Return the first of the article images.
Create a requests session optimized for many connections.
Take a raw tweet from the queue, extract and save metadata from its content, then save as a processed tweet.
Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.
Return a Twitter client for the given universe.
Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.
Get Twitter IDs for friends of the given user_id.
Lookup Twitter users by screen name. Limited to first 100 user names by API limitation.
Return a Twitter streaming client for the given universe.
Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.
This command also functions as an update of a universe.
Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.