Metadata

Metadata are created and stored in GDBase - a relational database, part of Algolytics analytical platform.

Tables with metadata are created with following SQL code:

CREATE TABLE clients(id INTEGER PRIMARY KEY /*, ...*/)
CREATE TABLE external_data(id INTEGER PRIMARY KEY, client_id INTEGER /*, ...*/)
CREATE TABLE variables(id INTEGER PRIMARY KEY, client_id INTEGER NOT NULL, event_id INTEGER NOT NULL, definition TEXT NOT NULL, type INTEGER NOT NULL CHECK(type = 1 OR type = 2), default_value TEXT)
CREATE INDEX ON variables(client_id)
CREATE TABLE aggregates(id INTEGER PRIMARY KEY, client_id INTEGER NOT NULL, variable_id INTEGER, aggregate_type INTEGER, window_type INTEGER, window_size/*count|time*/ INTEGER, window_shift/*count|time*/ INTEGER, definition TEXT, return_type INTEGER CHECK((definition IS NULL AND return_type IS NULL) OR (definition IS NOT NULL AND return_type IS NOT NULL AND (return_type = 1 OR return_type = 2))), dictionary_id INTEGER, external_data_id INTEGER, external_data_name TEXT, name TEXT NOT NULL CHECK(name REGEXP '[a-zA-Z_$][a-zA-Z_$0-9]*'))
CREATE UNIQUE INDEX ON aggregates(client_id, name)
CREATE TABLE derived_aggregates(derived_aggregate_id INTEGER NOT NULL, aggregate_id INTEGER NOT NULL)
CREATE INDEX ON derived_aggregates(derived_aggregate_id)
CREATE TABLE trigger_aggregates(trigger_id INTEGER NOT NULL, aggregate_id INTEGER NOT NULL)
CREATE INDEX ON trigger_aggregates(trigger_id, aggregate_id)
CREATE TABLE triggers(id INTEGER PRIMARY KEY, definition TEXT NOT NULL, needs_change INTEGER/*boolean*/ NOT NULL CHECK(needs_change = 0 OR needs_change = 1), `group` INTEGER NOT NULL)
CREATE TABLE model_triggers(model_id INTEGER NOT NULL, trigger_id INTEGER NOT NULL)
CREATE INDEX ON model_triggers(model_id, trigger_id)
CREATE TABLE model_aggregates(model_id INTEGER NOT NULL, aggregate_id INTEGER NOT NULL, used INTEGER/*boolean*/ NOT NULL CHECK(used = 0 OR used = 1))
CREATE INDEX ON model_aggregates(model_id, aggregate_id)
CREATE TABLE models(id INTEGER PRIMARY KEY, client_id INTEGER NOT NULL, used INTEGER/*boolean*/ NOT NULL CHECK(used = 0 OR used = 1), target_aggregate_id INTEGER NOT NULL, target_start INTEGER NOT NULL CHECK(target_start >= 0), target_length INTEGER NOT NULL CHECK(target_length >= 0), target_length_exact INTEGER/*boolean*/ NOT NULL CHECK(target_length_exact = 0 OR target_length_exact = 1), scoring_code TEXT CHECK(NOT used OR scoring_code IS NOT NULL), saved_state_time INTEGER CHECK(NOT used OR saved_state_time IS NOT NULL))
CREATE INDEX ON models(client_id)
CREATE TABLE dictionary(id INTEGER, categorical INTEGER/*boolean*/ NOT NULL CHECK(categorical = 0 OR categorical = 1), value TEXT CHECK(categorical OR value IS NULL), start FLOAT CHECK(NOT categorical OR start IS NULL), start_inclusive FLOAT CHECK((start IS NULL AND start_inclusive IS NULL) OR (start IS NOT NULL AND start_inclusive IS NOT NULL AND (start_inclusive = 0 OR start_inclusive = 1))), end FLOAT CHECK((NOT categorical OR end IS NULL) AND (start IS NULL OR end IS NULL OR end >= start)), end_inclusive FLOAT CHECK((end IS NULL AND end_inclusive IS NULL) OR (end IS NOT NULL AND end_inclusive IS NOT NULL AND (end_inclusive = 0 OR end_inclusive = 1))), mapped_value TEXT NOT NULL)
CREATE INDEX ON dictionary(id)

Tables structure

clients
- id - id, a primary key
- authorization data
- other data about a client
external data
- id - id, a primary key
- client_id
- data pointing at external source of data
variables (stores definitions of variables obtained from JSON)
- id INTEGER PRIMARY KEY
- event_id INTEGER - corresponds with the eventId field in JSON
- definition_type - a type of definition: JSON_PATH or TRANSFORMED
- definition TEXT - definition of changing event into variable (JSONPath)
  - if definition_type = JSON_PATH then definition contains an expression JsonPath. It should returns numerical or text value. Syntax of expressions is described in JsonPath_README.md. For efficiency reasons it is the best to use $.x.y.z expressions types. Moreover, optimalization for [?(@.a == 'x' && @.b == 3 && @.c == @.d && ...)] filters is added
  - if definition_type = TRANSFORMED then definition contains full definition of java class, which must inherit from DoubleTransformation or StringTransformation (without the import of this class). In defined class should be no specified package
- input_id - for TRANSFORMED variable indicates id of transferred to the class, definied in definition_type input variable. Id has to be JSON_PATH variable
- category - if not null, definition should return a list, if this list contains category, variable will be created
- type (numerical, categorical) - type of the variable definied in DBConstants (VARIABLE_...)
- default_value - default value of the variable, used if the definition returns null value
aggregates
- id - id, a primary key
- variable_id - indicates a variables table. If the aggregate is not created from a variables table, the value is null
- aggregate_type - type of the aggregate defined in DBConstants (AGGREGATE_...) (see Aggregate types section)
- window_type - type of the window defined in DBConstants (WINDOW_...) (see Window types section). Each variable used by target must have window type set to WINDOW_TARGET
- window_size (count/time) - window size, defined as number of occurrences or time in miliseconds
- window_shift (count/time) - window shift (for windows with shift), defined as number of occurrences or time in miliseconds
- definition - a Java expression defining aggregate as a derivative of other aggregates. Every aggregate used in the definition must be listed in derived_aggregates table. If the aggregate is not a derivative aggregate, the value is null
- return_type - type of aggregate defined by definition column (DBConstants.VARIABLE_...)
- dictionary_id - a dictionary from dictionary table. If null then the aggregate do not use a dictionary
- external_data_id - id of external data from external data table
- external_data_name - name of variable in external data
- name - variable name, used in Java expressions, unique for client_id
derived_aggregates
- derived_aggregate_id - id of derived aggregate from aggregates table
- aggregate_id - ids of aggregates used in creating the derived aggregate (may be multiple ids for given derived_aggregate_id)
triggers (definition of scoring moment, i.e. definition of a moment of creating a training line with target)
- id - id, a primary key
- definition - definition of given trigger variable, which is an expression in Java using input variables to calculate value of trigger variable. The expression returns true or false, it may be also a definition of user's affiliance to a model. Input variables from the Java expression must be listed in the trigger_aggregates table
- needs_change - scoring occurs only, if during the last launch of trigger the definition returned false
- group - a group to which the trigger belongs. If the model defines triggers from different groups, and each group contain multiple triggers, scoring occurs only, if at least 1 trigger in every group returns, that scoring must occur
models
- id INTEGER PRIMARY KEY
- active - if false, model is inactive, cannot be used for modeling or scoring. If there is new model added, but it is without code yet, active shuold have true value
- used - if false, the model is not used for scoring (it does not affect counting tables for modeling).
- target_aggregate_id - indicates target, used while building a model
- target_type:
  - if has GENERAL value, target_start, target\length, target_length_exact (described below) will be included
  - if has CLICK value:
    trigger_validator (described below) will be included
    at the offline start, giving negative and positive target value will be needed
    row of last positive trigger before positive target or last target in data if there is no positive target value is recorded to a table
    every annoucement after trigger is included in counting aggregates of target and validator. For every new trigger (if there was no positive target before) aggregates are counted again (aggregates counted before are forgotten)
  - if has CLICK_AD value, usage is special for clickad, inter alia:
    at the offline start, giving negative and positive target value is needed
    every annoucement, that responds to the display of advertisement to a user (for each user many rows can be shown) is recorded to a table
    fitting target and trigger is based on specific fields in JSON. For trigger, despite userId is null, target will be found basing on different fields
- target_start - time in miliseconds. The target window starts after this time
- target_length - length of the target window in miliseconds
- target_length_exact - negative value means, that target window is counted till the end of the data (cannot be smaller than target_length)
- trigger_validator - expression in Java, which, after trigger, has to return true. Thereby row can be recorded to a learning table. If it has null value, there is no additional restriction on recording rows. validator can use aggregates with any type of window (WINDOW_TARGET and WINDOW_GLOBAL are treated the same)
- saved_state_time - to what time the state of users was calculated after building a model. Value is set by application
- value - value connected with model, for example CPC (cost per click) in RTB case. If in response for request there is score returned (not value), then value should be 1
- client_value - value connected with model, for example CPC of a client in RTB case, if not needed then null
- positive_target_cnt - in RTB case: ordered targets' quantity, which should be generated in a campaign (ex. clicks); if not needed, then null
- weight - weight; in RTB case bidding price is value * score * weight
- positive_target_ratio - ratio of number of rows with positive target to the total number of rowsin table used for modeling. Until it is not counted, set value to 0
- category_id - campaign category
- use_category - 1 - model building per campaign category; 0 - model building per campaign
- end_date - time in miliseconds to the end of campaign (in RTB case: time to the end of which ordered number of positive targets (positive_target_cnt) should be made
model_aggregates (defining models' arguments, every model can use various aggregates, and every aggregate can be used in various models)
- model_id - id of the model from models table
- aggregate_id - id of the variable from aggregates table (may by many for given model_id)
- used - if positive, the variable is passed to scoring code, if negative, the variable is used in model building
trigger_aggregates (defining triggers' arguments, every trigger can use various aggregates, and every aggregate can be used in various triggers)
- trigger_id - id of the trigger from trigger table
- aggregate_id - id of the variable from aggregates table (may by many for given trigger_id)
model_triggers (defining models' triggers, every model can use various triggers, and every trigger can be used in various models)
- model_id - id of the model from models table
- trigger_id - id of the trigger from triggers table (may by many for given model_id)
dictionary (a dictionary mapping values from JSON to variables)
- id - id of group or interval of values
- categorical - if true, then value variable from this table is used as an input. If false, then start, start_inclusive, end, end_inclusive are used as an input
- value - a specific value from JSON
- start - start of an interval
- start_inclusive - defines, whether the start of the interval is open or closed
- end - end of an interval
- end_inclusive - defines, whether the end of the interval is open or closed
- mapped_value - value passed to an aggregate
default_model (default parameters used when adding new models from api level automatically)
- client_id - id of a client
- target_definition - definition of derivative aggregate being target (at uploading and creating new model string ${model_id} will be modified to model_id of new model)
- target_aggregates - aggregates needed to count the target (names of aggregates separated by comas, ex.:'agg1,agg2,agg3')
- trigger_definition - definition of trigger (at uploading and creating new model string ${model_id} will be modified to model_id of new model)
- trigger_aggregates - aggregates needed to count the trigger (names of aggregates separated by comas, ex.:'agg1,agg2,agg3')
- trigger_validator_definition - definition of trigger validator (at uploading and creating new model string ${model_id} will be modified to model_id of new model)
- trigger_validator_aggregates - aggregates needed to count the trigger validator (names of aggregates separated by comas, ex.:'agg1,agg2,agg3')
- target_type - corresponds to target_type variable in models table
- target_start - corresponds to target_start variable in models table
- target_length - corresponds to target_length variable in models table
- target_length_exact - corresponds to target_length_exact variable in models table
- target_aggregate_name - name of the aggregate being a target (at uploading and creating new model string ${model_id} will be modified to model_id of new model)
- weight - corresponds to weight variable in models table
model_urls (url adresses that should be included/excluded when building a model)
- model_id - id of a model
- client_id - id of a client
- url - url adress of www website (domena)
- included - if 1, website should be included in modeling, if 0 website should be excluded

Window types

public final static int WINDOW_TARGET       = 0;
public final static int WINDOW_GLOBAL       = 1;
public final static int WINDOW_TIME         = 2;
public final static int WINDOW_TIME_SLIDE   = 3;
public final static int WINDOW_COUNT        = 4;
public final static int WINDOW_COUNT_SLIDE  = 5;
public final static int WINDOW_CURRENT_TIME = 6;

Windows definition:

WINDOW_TARGET – a specific window type to define a target variable for the predictive model training process
WINDOW_GLOBAL – The window includes all the data history saved in the tool
WINDOW_TIME – The window aggregates the data in a window specified by time (given in ms). The length of the window is given in window_size. The window is of the "tumbling window" type
WINDOW_TIME_SLIDE - The window aggregates data in a window defined by time (given in ms) and offset by the time specified in the parameter window_shift (in ms). The length of the window is given in window_size. Sliding window
WINDOW_COUNT – window defined as an aggregate from window_size events
WINDOW_COUNT_SLIDE - window defined as an aggregate of window_size events moved back by window_shift events. Sliding window
WINDOW_CURRENT_TIME – window length window_size aggregated in real time

For the above window units, the window_lag parameter allows the window to be moved away by window_lag from the current moment.

Tumbling window - events are summarized in fixed time-fixed windows. The value changes when the window is closed

Sliding window - Events are summarized in fixed time-fixed windows but in this case the windows overlap each other, thanks to which we get more frequent updates of the value than for the "tumbling window"

Real time window – aggregate values are calculated in real time

Please note that real-time window calculation is computationally expensive, so avoid using it where it is not necessary (only if the application requires a real-time aggregate).

Aggregate types

public final static int AGGREGATE_COUNT     	= 1; 
public final static int AGGREGATE_SUM       	= 2;
public final static int AGGREGATE_LASTVALUE 	= 3; //last value based on the order defined by the timestamp field
public final static int AGGREGATE_EXISTS    	= 4; //counts number of events with non null value of a field
public final static int AGGREGATE_MIN       	= 5;
public final static int AGGREGATE_MAX       	= 6; 
public final static int AGGREGATE_CURRENT_VALUE = 7; //last value according to the time of event consumption by EventEngine

PreviousAggregate module NextComponents of metadata

Last updated 6 months ago