Event Engine [administrator]
Event engine – functional specification of the solution
Purpose of the solution
Efficient real-time event stream processing and on-line scoring based on analytical models using information from these event streams
The solution is scalable and configurable, which allows it to be used in various business branches (including: gaming, recommendations, Web analytics, IoT – e.g. processing of event streams from device sensors)
Scope of the solution
It is a complete system for processing event streams on-line (event aggregation and real-time scoring) and off-line (event aggregation to automatically build analytical models). The scope of the solution includes:
Ensure efficient handling of event streams from multiple clients at the same time
Write retail events to a repository for off-line aggregation and create analytic tables for modelling
Storing the state of off-line users
Aggregate counting – a module used for counting aggregates in off-line (building models) and on-line (on-line scoring), connected to the processing path in the selected environment
Automatic creation of analytical models (using ABM)
Automatic deployment of new models for use for on-line scoring (via metadata)
On-line scoring – checking the conditions that trigger scoring with individual models, triggering scoring and returning a response to the customer
Solution assumptions
The client application sends messages (events) in the form of jsons via an http connection (REST API). Events are fed into the event engine via the Kafka queue. Each event is written to the event repository to enable off-line processing.
Then:
In the on-line version:
The event is converted into variables (definition in metadata)
The user's aggregate values using these variables are refreshed
The scoring conditions for each model are checked (conditions that trigger scoring and conditions that check whether a given user should be scored with a given model)
For each model for which the conditions are met, a row of data is prepared for recalculation (based on the model description in the meta data)
Scoring is triggered
The response is returned to the client
In the off-line version (triggered every set period of time, automated process):
For each customer and model, aggregates are counted based on messages stored in a text file
For each user, 1 row can be created in the result analytic table containing the counted aggregates and the value of the target variable. For some users, the poem will not be created, because:
The scoring condition will not be met
The conditions for calculating the target window will not be met (e.g. the target window will be exactly 3 days, and there are only 2 days of history in the data)
Note: in the case of programmatic, multiple lines can be created for each user, because input jsons (bid requests) can actually contain several bid requests for different impressions. Then as many rows are created for the user as there was an impression id.
A separate analytical table is created for each model
The analysis table is the input for that count models
Selected models are automatically deployed (scoring code and information about the variables used are saved to metadata)
Scheme of operation of the system
Messages
The client application sends messages (events) in the form of jsons over an http connection
http connection sends messages in packets (in particular, there can be 1 message in a packet)
Data encryption: SSL (can be disabled by setting it in the http server configuration file)
Messages can come from multiple sources (e.g., game servers, users)
Differentiation by client_id
The application is configurable for specific customers by defining dedicated metadata (variables, aggregates, models)
The order in which messages are processed, based on the time the event arrives in the system, is maintained
Event json format
Generic formats – for them we provide effective processing
Variable example:
$.['eventType1'].['eventA'].['value']
returns10
Variable example:
$.['eventType1'].['value']
returns "value1
"In addition, queries that allow the insertion of conditions that check for equality "
==
" and are combined with "&&
" are optimized:$.['eventType1'].['eventA'].[?(@.['value'] == 10 && @.['name'] == "AAA")].value
returns[10]
or:$.['eventType1'].[?(@.['eventA'].['value'] == 10 && @.['eventA'].['name'] == "AAA")].value
returns["value1"]
Filters using "==
" and "&&
" can be at different levels of the querycodeLists of values of the "category" type are also optimized: ["A1", "A2", "A3"] if we want to pull out a category variable that is a list, but ultimately we want to build aggregates that count the number of events with a given value in the list, e.g. A1_cnt_all, A2_cnt_all, A3_cnt_all
Arbitrary formats, compliant with JSONPath (slower processing efficiency) – the analyst adds rules for converting to a variable to the metadata (table variables, point Definition of target):
Example using a regular expression:
$.[? (@.name =~ /.*eve.*/i)].['type'].[1].['value']
returns [45]
Response to the customer's message:
Score is returned directly in the query response
The results include the following fields:
userId – The user's ID
scores – a list of models and scores; empty if no scoring has occurred
modelsId - models
score – the value of the score for the model
Response to the client message in the case of programmatic:
Score is returned directly in response to the bid request
The results include a list of suggested bidding prices for each impression:
impid – impression id
scores – a list of models and scores. The stores object is empty if no scoring has occurred or there is no active deployed model. Otherwise, it contains elements where the key is the model id and the value is the suggested bidding price
The price is calculated according to the formula:
min(10 * value, score * value * weight)
score
value – the value of the variable from the models table (CPC)
weight – weight from the models table (by default 1000, because the prices are bid in the CPM rate)
Unit counting module
Counting aggregates in the on-line and off-line versions (in both versions, the aggregates are counted with the same code)
Off-line version
Processing is parallelized after user id
Aggregates are counted based on retail events written to the event repository
The launch of the off-line version can be scheduled (offline scheduling, point Scheduling offline processing)
On-line version
Processing is parallelized after user id
On-line generators are counted and stored in memory (for on-line users)
After a set time of user inactivity defined in the configuration file (no messages about a given user), the aggregates are saved in the mongo db database (the user logs out)
Types of aggregates
Incremental Count for Whole Data
Counted in a window (time and with a specific number of messages)
Sliding windows (timed and with a specific number of messages)
Target windows – used for target calculation, only in the off-line version
A single message can belong to multiple windows
The list of counted aggregates is defined in the metadata
Aggregate List
Number of occurrences, sum, last value, flag if the event occurred, min, max, current value (from the currently processed json)
Derivative aggregates defined in the form of expressions in Java (e.g. aggr1 + aggr2)
Aggregates resulting from defined dictionaries (described in the section on dictionaries)
How messages are stored
All messages are written to the message repository. If necessary (e.g. customer requirement), a backup can be created
Writing messages to the repository does not block further processing
Repository for storing messages: txt files
Structure and storage of metadata
Metadata is created and stored in the
Creating tables:
Tables structure:
clients
:
(not used yet)
id
INTEGER PRIMARY KEY
authorization data
other customer data, e.g. payments, access restrictions
external_data
:
(not used yet)
id INTEGER PRIMARY KEY
client_id INTEGER
Data pointing to an external data source
variables
:
(stores definitions of variables obtained from json)
id INTEGER PRIMARY KEY
event_id INTEGER
- corresponds to the eventId field passed in jsondefinition_type
– definition type: JSON_PATH lub TRANSFORMEDdefinition TEXT
- definition of converting an event to a variable (JSONPath)if definition_type = JSON_PATH then definition contains a JsonPath expression. It should return a numeric or text value. The syntax for expressions is described in the JsonPath_README.md file. For performance reasons, it's best to use only expressions like $.x.y.z. In addition, optimization for [?( @.a == 'x' && @.b == 3 && @.c == @.d && ...)]
if definition_type = TRANSFORMED then definition contains the full definition of the Java class, which must inherit from DoubleTransformation or StringTransformation (there should be no import of this class). There should be no specified package in the class being defined
input_id
- for the variable TRANSFORMED indicates the id of the input variable passed to the class defined in the definition_type. This id must be a variable JSON_PATHcategory
- if non-null then definition should return a list and a variable will be created if category is present in this listtype
(numerical, categorical) - variable type defined in DBConstants (VARIABLE_...)default_value
– the value of the variable used if definition returns null
aggregates
:
(stores the aggregate definitions used by models and triggers)
id INTEGER PRIMARY KEY
variable_id INTEGER
- points to the variables table. Null if the aggregate is not produced from the variables tableaggregate_type INTEGER
- type of aggregate defined in DBConstants (AGGREGATE_...), the list of possible types is permanently saved in Java - maybe add a dictionary)window_type INTEGER
- a type of window defined in DBConstants (WINDOW_...). All variables used by the target must be set WINDOW_TARGETwindow_size /*count|time*/ INTEGER
- window size as number of event occurrences or time in mswindow_shift /*count|time*/ INTEGER
- for windows with offset - number of event occurrences or time in msdefinition TEXT
- a java expression that defines an aggregate based on the values of other aggregates. The derived_aggregates table must list all aggregates used in the definition. Null if the aggregate is not a derived aggregatereturn_type
- type of aggregate defined by definition (DBConstants.VARIABLE_...)dictionary_id
- if null then the aggregate does not use the dictionary. If non-null then points to rows from the dictionary tableexternal_data_id INTEGER
- not used yetexternal_data_name TEXT
- not used yet - name of the variable in the external data sourcename TEXT
- the name of the variable. This name is used in java expressions, including the definition in this table (i.e., it should not be a java keyword), unique to client_id
derived_aggregates
:
(defines arguments for derived variables)
derived_aggregate_id INTEGER
– identifier of the aggregate from the aggregates table for which arguments are definedaggregate_id INTEGER
- aggregate id from the aggregates table being an argument (there can be many for a given derived_aggregate_id)
triggers
:
(i.e. definition of the moment of scoring == definition of the moment of creation of the training row with the target, in this table also definitions of groups of users scored with the same model)
id INTEGER PRIMARY KEY
definition TEXT
- definition of a given trigger variable, i.e. an expression in java that uses input variables to calculate the value of a given trigger variable, the expression returns true or false, it can also be a definition of the user's membership in the model. These variables must be listed in the trigger_aggregates tableneeds_change INTEGER/*boolean*/
- scoring will occur only if the definition expression returned false on the previous trigger rungroup INTEGER
- the group to which the trigger belongs. If the model defines triggers from several groups and there are several triggers in each group, then scoring will occur if at least 1 trigger for each group returns that scoring should be performed
models
:
id INTEGER PRIMARY KEY
active
- if false then the model is inactive, it cannot be used for modeling or scoring. If a new model is added, but there is no scoring code yet, then active should be trueused
- if false, the model is not used for scoring (it does not affect the calculation of the table to be modeled)target_aggregate_id
- indicates the target. It is used in the construction of the modeltarget_type:
if the value is GENERAL then the target_start, target_length, target_length_exact described below will be taken into account
if the value is CLICK then:
will be taken into account trigger_validator described below
When starting offline, it is necessary to enter the positive and negative target values
The table is credited with a row for the last positive trigger before the positive target occurred, or the last one in the data if there was no positive target
To count the aggregates of the target and validator, all messages after the trigger occurrence are taken into account. For each new occurrence of the trigger (if there was no positive target before), aggregates are counted anew (previously counted are forgotten)
If the value is CLICK_AD then the handling is specific to ClickAd, including:
When starting offline, it is necessary to enter the positive and negative target values
Each message that corresponds to the display of the ad to the user is saved in the table (multiple rows may appear for one user)
A specific JSON format is assumed. Among other things, there are fields from com.algolytics.streaming.clickad.Constants.
Matching the target to the trigger is based on specific fields in JSON. For a trigger, if userId is null, then target will still be found based on other fields
target_start
- the time in ms after which the target window startstarget_length
– window length in mstarget_length_exact
- negative value means that the target window is counted until the end of the data (but cannot be less than target_length)trigger_validator
- an expression in Java that must return true after the trigger occurs in order to write the row to the training table. If equal to null, then there is no additional restriction on the written rows. The Validator can use aggregates with any type of window (WINDOW_TARGET and WINDOW_GLOBAL are treated the same)saved_state_time
- the time in ms to which the state of users after building the model was counted. Set by appsfirst_model_time
- time in ms of the first saving of the scoring codesaved_model_time
- time in ms of the last time the scoring code was savedValue
- A value associated with the model, e.g. CPC (Cost per Click) in the case of RTB. If we return a score in response to a request and not a value, then value should be 1client_value
- value related to the model, e.g. client CPC in the case of RTB, how unnecessary is nullpositive_target_cnt
- in the case of RTB - the ordered number of positive targets to be generated in the campaign (e.g. clickow), if unnecessary is nullweight
- weight, in the case of RTB the bid price is: value * score * weightpositive_target_ratio
- the ratio of the number of rows with a positive target to all rows in the table used for modeling, until counted then it should be set to 0category_id
- campaign categoryuse_category
- 1 - model built per campaign category, 0 - model built per campaignend_date
- the time in MS of the end of the campaign (in the case of RTB, the time by which the ordered number of positive targets is to be made positive_target_cnt)
model_aggregates
:
(table needed as each model can use multiple generators and each generator can be used in multiple models)
model_id INTEGER
- id of the model from models tableaggregate_id INTEGER
- the id of the variable from the aggregates tablel (there can be many for a given model_id)used
- positive value means that the variable is passed to the scoring code, negative value means that the variable will be used to build the model
trigger_aggregates
:
(defines the triggers arguments, table needed, because a given variable of triggers can use multiple aggregates, and each aggregate can be used by multiple variables)
trigger_id INTEGER
- id of trigger from trigger tableaggregate_id INTEGER
- the aggregate id from the aggregates table, which is an argument (there can be many for a given trigger_id)
model_triggers
:
(defines triggers for models, table needed because a given variable of triggers can be used by multiple models, and each model can use multiple triggers)
model_id INTEGER
- id of the model from models tabletrigger_id INTEGER
- trigger id from the triggers table (there can be many for a given model_id)
dictionary
:
(dictionary that maps values from JSON to values passed to the aggregate)
id
- the id of a group of values (they should not be repeated for the same id) or ranges (they should be disjoint for the same id)categorical
- if true, the value is taken as input. If false, start, start_inclusive, end, end_inclusive are taken as inputValue
- a specific value from JSONstart
- specifies the beginning of the numeric value intervalstart_inclusive
- specifies whether the start of the interval is open or closedend
- specifies the end of the range of numeric valuesend_inclusive
- determines whether the end of the interval is open or closedmapped_value
- value transferred to the aggregate
default_model:
default parameters used when automatically adding new models from the API level
client_id
- id of a clienttarget_definition
- definition of the target derived aggregate (when loading and creating a new model, the string ${model_id} will be replaced with the model_id of the new model)target_aggregates
- aggregates needed to count the target (aggregate names separated by commas, e.g.: 'agg1, agg2, agg3')trigger_definition
- trigger definition (when loading and creating a new model, the ${model_id} string will be replaced with the model_id of the new model)trigger_aggregates
- aggregates needed to count the trigger (aggregate names separated by commas, e.g.: 'agg1, agg2, agg3')trigger_validator_definition
- definition of trigger validator (when loading and creating a new model, the string ${model_id} will be replaced with the model_id of the new model)trigger_validator_aggregates
- aggregates needed to count trigger validator (aggregate names separated by commas, e.g.: 'agg1, agg2, agg3')target_type
- corresponds to the variable target_type in the models tabletarget_start
- corresponds to the variable target_start in the models tabletarget_length
- corresponds to target_length variable in models tabletarget_length_exact
- corresponds to target_length_exact variable in models tabletarget_aggregate_name
- name of the target aggregate (when loading and creating a new model, the string ${model_id} will be replaced with the model_id of the new model)weight
- corresponds to the weight variable in the models table
model_urls:
URLs to be included/excluded when building the model
model_id
- id of a modelclient_id
- id of a clienturl
- URL of the website (domain)included
- if 1 then the page should be included in the modeling, if 0 then excluded
The types of variables, aggregates, and types of available aggregation windows are defined in the DBConstants file:
Types of variables:
Types of windows:
WINDOW_TARGET – a specific window type to define a target variable for the predictive model training process
WINDOW_GLOBAL – The window includes all the data history saved in the tool
WINDOW_TIME – The window aggregates the data in a window specified by time (given in ms). The length of the window is given in window_size. The window is of the "tumbling window" type
WINDOW_TIME_SLIDE - The window aggregates data in a window defined by time (given in ms) and offset by the time specified in the parameter window_shift (in ms). The length of the window is given in window_size. Sliding window
WINDOW_COUNT – window defined as an aggregate from window_size events
WINDOW_COUNT_SLIDE - window defined as an aggregate of window_size events moved back by window_shift events. Sliding window
WINDOW_CURRENT_TIME – window length window_size aggregated in real time
For the above window units, the window_lag parameter allows the window to be moved away by window_lag from the current moment.
Tumbling window - events are summarized in fixed time-fixed windows. The value changes when the window is closed
Sliding window - Events are summarized in fixed time-fixed windows but in this case the windows overlap each other, thanks to which we get more frequent updates of the value than for the "tumbling window"
Real time window – aggregate values are calculated in real time
Please note that real-time window calculation is computationally expensive, so avoid using it where it is not necessary (only if the application requires a real-time aggregate).
Types of aggregates:
High-level events
Events that trigger scoring / events that trigger target counting
Defined in metadata in the form of expressions in Java (triggers table)
Example: "
(aggr1 == 5 && aggr2 == 8) || (aggr1 < 4 && aggr2 == 1)
"
Definition of target
Target is defined as an aggregate in the aggregates table with a special window type window_type = 0. Then the aggregate id should be entered in the models table
An example of the aggregates table – aggregate count as target:
id
,variable_id
,aggregate_type
,window_type
,name
, …1
,1
,1
,0
,D_exists_all
Example form of the models table:
id
,used
,target_aggregate_id
,target_start
,target_length
,target_length_exact
, …1
,1
,1
,0
,0
,0
In the example above, the target window is counted from the next event after the event setting the trigger to true (target_start = 0) and is counted until the end of the data (target_length = 0 and target_length_exact = 0)
Example:
The target is the aggregate D_exists_all taking the value 1 if the event "D" occurred in the given target window and 0 if it did not occur
If target_start = 0, then the target window is counted from the next event after the trigger event is true. Otherwise, the target window starts counting target_start milliseconds after the trigger event occurs
The target window has a length of target_length in milliseconds. If target_length = 0, then all messages from the trigger occurrence to the end of the data are taken into account to calculate the target value
If target_length > 0 and target_length_exact = 1, it means that if there is no data for the entire window length period, the target will not be counted and the row for the given user will not appear in the resulting table with aggregates
Converting Events to Variables
Defined in metadata in the form of rules in JSONPath. It is a library that allows you to search in json (variables table)
Example in point Messages
Variable derivatives It is possible to define derived variables. The definition of such a variable is in the form of the Java class.
In the variables table, define a variable of the type: definition_type = TRANSFORMED
In the definition field, enter the full definition of the Java class, which must inherit from DoubleTransformation or StringTransformation (there should be no import of this class). There should be no specified package in the class being defined.
For a variable of type TRANSFORMED, type the id of the input variable passed to the class defined in the definition_type in the input_id field. This id must be a variable of type JSON_PATH.
Example1 – a variable returning a domain from a url:
Input variable:
id = 1
definition_type = JSON_PATH
definition = $. ['url']
Variable derivative
id = 2
input_id = 1
definition_type = TRANSFORMED
definition:
Example2 – a variable returning the day of the week based on the time in ms, by default the time field in json. The name of the field in json denoting time can be configured in config (jsonTimeName field):
Input variable:
id = 1
definition_type = JSON_PATH
definition = $. ['time']
Variable derivative
id = 2
input_id = 1
definition_type = TRANSFORMED
definition:
State storage for off-line users
For users who are currently off-line, aggregate values are stored.
Database for storing state: mongo db
Off-line event processing and modeling
The off-line aggregate counting and modeling process can be repeated at pre-set intervals (triggered automatically by offline scheduling or manually on demand)
For each model, an analytical table is created containing 1 row for each user for whom the trigger (the condition that triggers the scoring of a given model) has been met and the target has been calculated Note: in the case of programmatic, multiple lines can be created for each user, because input jsons (bid requests) can actually contain several bid requests for different impressions. Then as many rows are created for the user as there was an impression id.
For each analysis table, the calculation of ABM processes is run, and the finished models (scoring code) and information about the variables used (model signature) are saved to the engine meta data
The aggregate table is the input for the ABM process, which automatically selects variables and calculates the optimal model
The method that invokes off-line processing allows the table with aggregates to be saved for further analysis, or manual modeling by an analyst. In the call, specify the target alias for gdbase and the name under which the table should be saved
Models
Model information is stored in the metadata in the models table
Models used for on-line scoring have the used = 1 flag and the active = 1 flag set. If the model has not yet been built, but is active, then it only has the active = 1 flag
The models table stores the scoring code as a string
Deployment of the new model
The new model after recalculation is automatically implemented - the new model overwrites the old model with the same id
If the model uses different variables than the previous one, you have to recalculate all the necessary aggregates backwards to have their current state. Aggregates are added iteratively to ensure the lowest possible latency of processing the first message with the new model. First, the necessary aggregates on the stored retail messages are counted, but in the meantime new messages may have arrived, so the next iteration updates the aggregate values with the additional ones. The process is repeated several times
Scoring
Scoring is triggered by the occurrence of a high-level event. Events are defined in the metadata (triggers table)
Different events can trigger scoring with different models (assigning high-level events to models in the meta)
Different groups of users can be cored with different models. User group definitions are in the form of expressions defined in the same way as scoring triggers (also the triggers table)
A given user can be scored by multiple models at once (i.e. one event triggering scoring can be assigned to multiple models: table model_triggers)
The scoring code is stored as a string in the models table
In the case of programmatic, the modeling table is built at the level of the unique user id and the impression id. So the scoring is also at the level of impressions. Events that trigger scoring (bid requests) can actually contain several bid requests (a list of impressions). In this case, at the beginning of processing, the event is divided into several events (one for each impression) and only those events are scored
Dictionaries
In order to define aggregates more easily, dictionaries can be used. Dictionaries can be defined in the metadata in the dictionary table. Then, when defining the aggregate, the dictionary_id of the appropriate entry in the dictionary should be provided.
Example of use:
Variable: variable1 takes the values: 1, 2, 3, 4, 5 in the table variables has id 1
Dictionary:
We want to count the following aggregates:
Form of the dictionary table:
Form of the aggregates table (selected columns):
App Description
Components:
Core - Processes messages and returns a score. It must use the JDK (JRE is not enough). The application allows you to run multiple Core processes, which allows you to handle higher traffic volumes. Also, if you need to ensure that the garbage collector doesn't stop the application for a long time, you can also run multiple Core processes - then a single process will take up less memory, so starting the garbage collector will be faster. With many Core processes, each processes a certain pool of users resulting from the partitions created on Kafka. Partitioning is after the hash of the user's id. Running multiple Core processes will prevent the garbage collector from stopping the application for a long time (a single process will take up less memory, so starting the garbage collector will be faster).
HTTPServer - receives queries, sends to the server from Core, receives the result and returns it to the user. A detailed description of the supported queries can be found in the
17
and18
.Metadata - you need a gdbase server with created and populated user data with tables. Table definitions are in the metadata.sql file (in the tool's sources).
MongoDB - MongoDB database is needed.
Kafka - you need Kafka and created 2 topics (the names of the topics are written in the configuration files. keys kafka_request_topic and kafka_response_topic). Authorization must be disabled in Kafka. If you run multiple Core processes, you need to create a kafka_request_topic with a number of partitions equal to the number of processes (e.g.: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 4 --topic request) and set kafka_request_topic_partitions to this value before running them.
InfluxDB and Grafana – used to collect and visualize statistics on-line. In influx, information about incoming events is collected, e.g. the number of events of various types, the number of serrations, processing times, etc.
Configuration: The application can be configured with settings in the config.properties files (separate for Core and for HTTPServer). The file is loaded from the current directory. Keys starting with kafka_consumer_ are passed to the consumer kafka. The key passed to Kafka is the part of the key in the config.properties file after removing the Kafka_consumer_ prefix. In the same way, keys starting with Kafka_producer_ are passed to the Kafka producer. By using these prefixes, you can set any keys for Kafka, not just those that are in the provided config.properties.example.
Component commissioning Running the install.sh file (from the deploy directory) will cause all components to start at boot. There will also be commands such as: service gdbase restart All components must be installed first. If the components are installed in other directories than those used by *.service, you need to change the paths in these files. Default paths:
Starting the model build:
Building a model (i.e. off-line processing) should be run from the directory in which the streaming_core.jar is
It should be started with the appropriate parameters:
nice -n 19 ionice -c 3 java -cp streaming_core.jar com.algolytics.streaming.Offline ...
A list of all and required parameters will be displayed on the console or in the logs. Default: Log/offline.log The nice + ionice parameters ensure that offline processing will not burden online processing.
If ABM is run locally (by specifying the abmScript option), then:
before running you need to set the LD_LIBRARY_PATH /LIBPATH/PATH analogously to what is done in the AdvancedMiner launcher scripts
you need to configure the AdvancedMiner, e.g. it is worth setting a higher value MAX_SCRIPT_EXECUTOR_HEAP_SIZE
Offline call parameters:
modelIds – list of model ids
start time – time in ms since when to count the table to be modeled
endTime – time in ms until when to count the table to be modeled
startDelay – offset in ms from when to count (current time – startDelay)
endDelay – shift in ms to when to count (current time – startDelay)
stateStartTime – time in ms from when to calculate the state
stateStartDelay – time in ms until when to count the state
copyURL – alias to gdbase, if given then the resulting table will be copied there
copyTablePrefix – prefix for the table name if it is to be copied to gdbase
copyUser – user do gdbase
copyPassword – hasło do gdbase
processMethod – typ metody (approximation, gold, quick, advanced)
positiveTargetValue – value denoting a positive target
negativeTargetValue – a value denoting a negative target
qualityMeasureName – quality measure (as in ABM)
cutoff – cutoff threshold for score (as in ABM)
samplingMode – type of sampling (as in ABM)
samplingSize – sample size (as in ABM)
samplingStratificationMode – stratification type (as in ABM)
samplingPositiveTargetCategoryRatio – percentage of positive target at stratification (as in ABM)
classificationThreshold – threshold (jak w ABM)
classificationThresholdType – typ thresholdu (jak w ABM)
profitMatrixOptimized
profitMatrixCurrency
profitMatrixTruePositive
profitMatrixFalseNegative
profitMatrixFalsePositive
profitMatrixTrueNegative
useTestData
threadCount – the number of threads
abmScript – path to the ABM script
abmAuthToken – token (when calculating ABM web)
userType – user type
stateUserType
minROCArea – min ROC for the model to be implemented
minimumPositiveTargets – the minimum number of positive targets to build a model (programmatic)
useModelUrls – if true – filter Urls based on model_urls (programmatic)
triggerEventInTarget – if true – include the event that triggered the trigger to count the target window
Detailed recommendations for the installation and configuration of individual application components can be found in the sources (deploy directory).
API
Send json with event Example file with a package of messages from the client (test.json):
Sample customer inquiry:
Sample answer:
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
time
Time in ms (field name can be changed in the configuration file, field: jsonTimeName)
YES
eventId
Event type (the field name can be changed in the configuration file, jsonEventIdName field)
YES
userId
User ID (the field name can be changed in the configuration file, jsonUserIdName field)
YES
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
200
Json z polem "error", np.:
{"error": "Error during json parsing"}
Error in parsing (e.g. when parsing)
Query for the current profile (list of aggregates) of the selected user
Sample customer inquiry:
Sample answer:
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
userid
User ID
YES
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
400 Bad request
No userid
API for adding models
Activating the API for adding models is done by setting the variable
enableModelsApi = true in the configuration file (for Core applications)
Adding a model
Sample customer inquiry:
Sample answer:
{}
– no error
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
modelid
Model ID
YES
value
W programmatic CPC
YES
client_value
In programmatic customer CPC
NO
use_category
If 1 – the model will be built on categories (then category_id must be provided), if 0 – the model on campaigns
YES
category_id
Model category ID (in the case of programmatic, this is the campaign category)
YES if use_category = 1
positive_target_cnt
In programmatic – the number of clicks ordered
NO
excluded lub included
A list of urls to exclude from modeling (if excluded) to be taken into account in modeling (if included). There can be either an included or an excluded field.
NO
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
404 Not Found
No modelid
400 Bad request
Wrong modelid format
200
{"error": "Parameter use_category must be set and must be an integer (0 or 1)"}
Not set or wrong format use_category
200
{"error": "Parameter value must be set and must be numeric"}
Not set or wrong value format
200
{"error": "Parameter category_id must be provided if use_category = 1"}
Not set category_id and use_category = 1
200
{"error": "Exactly one parameter must be set (either included or excluded)"}
None or both fields provided: excluded and included
Modifying Model Parameters
The same query as when adding a model, but in addition to the mandatory parameters, only the modified ones should be specified.
Sample customer inquiry:
Sample answer:
{}
– no error
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
modelid
Model ID
YES
value
W programmatic CPC
YES
client_value
In programmatic customer CPC
NO
use_category
If 1 – the model will be built on categories (then category_id must be provided), if 0 – the model on campaigns
YES
category_id
Model category ID (in the case of programmatic, this is the campaign category)
YES if use_category = 1
positive_target_cnt
In programmatic – the number of clicks ordered
NO
excluded lub included
A list of urls to exclude from modeling (if excluded) to be taken into account in modeling (if included). There can be either an included or an excluded field.
NO
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
404 Not Found
No modelid
400 Bad request
Wrong modelid format
200
{"error": "Parameter use_category must be set and must be an integer (0 or 1)"}
Not set or wrong format use_category
200
{"error": "Parameter value must be set and must be numeric"}
Not set or wrong value format
200
{"error": "Parameter category_id must be provided if use_category = 1"}
Not set category_id and use_category = 1
200
{"error": "Exactly one parameter must be set (either included or excluded)"}
None or both fields provided: excluded and included
Deactivate a model
Sample customer inquiry:
Sample answer:
{}
– no error
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
modelid
Model ID
YES
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
404 Not Found
No modelid
400 Bad request
Wrong modelid format
200
{"error": "model 111 does not exist."}
There is no model with the given id
Adding Urls for excluding/including during modeling (for programmatic)
Sample customer inquiry:
Sample answer:
{}
– no error
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
modelid
Model ID
YES
Excluded lub included
A list of urls to exclude from modeling (if excluded) to be taken into account in modeling (if included). There can be either an included or an excluded field.
YES
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
404 Not Found
No modelid
400 Bad request
Wrong modelid format
200
{"error": "Exactly one parameter must be set (either included or excluded)"}
None or both fields provided: excluded and included
Retrieve information about the selected model
Sample customer inquiry:
Sample answer:
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
modelid
Model ID
NO – then all models will be returned
Error codes:
Name
Output JSON
Plaintiff
403 Forbidden
No appid
400 Bad request
Wrong modelid format
200
{}
There is no such model
Retrieve information about all models
Sample customer inquiry:
Sample answer:
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
Error codes:
Name
Output JSON
Plaintiff
200
[]
There are no models for such an appid
Retrieving information about the list of used/unused, active models
Sample customer inquiry:
Sample answer:
Parameters:
Name
Description
Is it mandatory to
appid
Customer ID
YES
used
1 – list of models used in on-line scoring (active=1 and used=1 in meta data)
0 – list of active models, but with used = 0 (e.g. the model has not been built yet)
Error codes:
Name
Output JSON
Plaintiff
200
[]
There are no models for such an appid and set value of used
Scheduling offline processing
It is possible to schedule the process of building models
Offline scheduling is activated by setting the enableOfflineScheduler = true
variable in the configuration file (for the Core application)
Scheduling is done by calling the API request to add models (point API for adding models).
The first time you add a model, the job is scheduled for the next day. If the building is successful and the model is implemented (a sufficient number of positive targets, the appropriate quality of the model based on ROC), the next build is scheduled for a week.
Currently, offline processes run serially (this is set in the code in the configuration of the Quartz library used for scheduling)
Metadata overload while the app is running
While the application is running, you can manually reload the metadata by calling the following command from the Core directory:
You can also modify the parameters of a single model (including the weight parameter that affects the calculated bid price in the programmatic case):
Visualization - EVE Metrics
Requirements t needs influxDB and grafana or Power BI to work.
How it works: Metrics are sent by the application (Core) to the influxDB database or to PowerBI, and then visualized by Grafana (or PowerBI). Statistics are counted for all requests that are processed by the application within a certain period of time. In Grafana, you need to define the influx as the DataSource from which the metrics will be retrieved. The app doesn't connect directly to Grafana.
Configuration: The influx parameters are defined in the configuration file (for the Core application) along with other parameters for metrics:
metrics_destination - INFLUX_DB if statistics are to be sent to influx POWER_BI if to Power BI, NONE if statistics counting is to be disabled
influx_db_user - username
influx_db_password - password
influx_db_database - database name
custom_request_fields - fields from the incoming event to the engine by which metric values are to be aggregated
metric_processed_times – the value for which the number of messages that occurred in so many milliseconds is counted.
metric_time_window - every so many seconds, the metrics are recalculated for the collected events
aggregate_time_window - used by MeanScoreMetric, calculates the average score value in a time window. Defined in seconds, it should not be less than metric_time_window.
max_metric_calculation_threads - The maximum number of threads to compute metrics. The number of threads is determined by the number of metrics defined, but it cannot be greater than the maximum number of threads.
event_request_metrics - metric names for events of the EVENT type, e.g. ScoreMetric; ProcessedRequestsMetric; WinPrcMetric; BidPrcMetric
profile_request_metrics - metric names for events of the PROFILE type
Available metrics:
ProcessedRequestsMetric Presents the number of processed requests in a given metric_time_window along with the number of incorrect requests and those for which skinning was performed. The collected statistics are aggregated per clientId and user-defined fields in the custom_request_fields configuration. Fields in JSON sent to Grafana (influx name: processed_requests):
processed (number)
processed_time_[numer z configa] (number)
scored (number)
errors (number)
min_time (number)
max_time (number)
mean_time (number)
sum_time (number)
MeanScoreMetric The metric averages the score for each clientId and modelId, and user-defined fields in the custom_request_fields configuration. The time window size is configured by the aggregate_time_window with an offset every metric_time_window. In the programmatic version, the suggested bidding price is returned instead of the score. Fields in json sent to Grafana (influx name: score):
min_score (number)
max_score (number)
mean_score (number)
sum_score (number)
scores_count (number)
modelId (text)
WinPrcMetric Metric used only in the programmatic version, regarding the price per won view Fields in json sent to Grafana (influx name: win_prc):
min_win_prc (number)
max_win_prc (number)
mean_win_prc (number)
sum_win_prc (number)
count_win_prc (number)
BidPrcMetric Metric used only in the programmatic version, concerning the bid price (taken from bid response) Fields in json sent to Grafana (influx name: win_prc):
min_bid_prc (number)
max_bid_prc (number)
mean_bid_prc (number)
sum_bid_prc (number)
count_bid_prc (number)
Last updated