Getting Started with logs and monitoring in Splunk
Splunk is a log collection and management system that receives log streams from SCALE and stores logs for searching and analysis, while also parsing certain logs for key metrics that are then tracked in pre-constructed dashboards. The dashboards and the searchable logs are grouped by application codes (appCode) consistent with application stacks as provisioned and managed in SCALE. Anyone with access to the development spaces or automation environments for a given application (appCode) can access the Splunk dashboards and logs for that application; depending on the specific Active Directory groups a user is assigned to, a user can also create their own custom dashboards.
In order to access Splunk, open a browser to: https://splunk.corp.netapp.com and log in using normal Active Directory credentials.
Once logged in, make sure to be in the correct application context: NGDC, selected from a drop-down list for App: located in the upper left corner of the browser window. A good starting point is a dashboard called Application detail in SCALE which can be found under the the list of dashboards by clicking the Dashboards link. A more direct way to open and start from this dashboard is to open your browser directly to this link: https://splunk.corp.netapp.com/es-US/app/ngdc/details_of_an_application_in_scale
From the Application detail in SCALE dashboard, under the dropdown labeled AppCode select the appropriate application code (appCode) and multiple sections of the dashboard will populate with information for the selected application. There is a lot of information on this dashboard and it will be necessary to scroll down to find it all. Most items will be self-explanatory with the top set of dashboard elements showing general usage of SCALE infrastructure by the selected application, a middle section showing capacity usage trends under the application's workspace, hostspaces and dataspacees, followed by recent errors or warnings from OpenShift events for the application. The lower portion of the dashboard then lists different types of log streams Splunk is collecting for the application, each of which can be clicked to begin a search for specific log entries or to aggregate and summarize in different manners.
Directing logs to the correct log buckets
Logs collected into Splunk are grouped based on 3 attributes:
| Attribute | Description | Assigned from... |
|---|---|---|
| index | Application | appCode with "scale-" prepended |
| sourcetype | Type of log | Set in Kubernetes annotations based on type of application stack |
| openshift_namespace | Openshift namespace | Openshift namespace |
Most sourceType assignments are predetermined and are independent of the applications deployed into SCALE or are simply defined outside the scope of an application's deployment. There are several openshift_ prefixed source types that are preset in this way. The more application-specific source types are set according to the application stack from the product catalog as managed within the Helm charts for each application. Similarly, the index is determined by the value of the appCode as it is assigned, in the Helm chart, to the app.code YAML value.
For example, to ensure that the logs for an application under a 3-letter appCode of xyz are directed to the correct index, make sure that app.code is set within the application's values.yaml file, under its Helm charts, as follows:
app:
code: xyz
Special case of log formats for CaaS applications
One special case regarding the setting of the sourceType is the CaaS (container-as-a-service) application stack. In order to send the logs for the container's application to the correct source type (sourceType), the Kubernetes deployment object, as defined in the templates/deployment.yaml file of the application's Helm chart, should set the correct source type in the value assigned to the label for collectord.io/logs-type. This could be a pre-defined log format as delivered by the Splunk system (refered to as "pretrained" source types) or a new one to be defined for the CaaS application.
For a list of existing pretrained soure types, please see the Splunk documentation at this link: Spiunk Pretrained Source Types.
If it should be necessary to add a new source type in order to parse a new log format for a CaaS application, see the Splunk documentatin at this link for instructions: Splunk Manage Source Types.
Note about applying changes to Kubernetes annotations (labels)
Please note: When making changes to any of the annotations for a Kubernetes deployment, in some cases subsequent deployments could fail because Kubernetes does not always handle annotation changes well. If the attempted redeployments should fail with a "the job spec is invalid...the field is immutable" error, then manually remove the old deployments and allow the CI/CD pipeline to redeploy again.
Searching Logs in Splunk
The most basic function of Splunk is to facilitate searches of logs, either in its raw form or to be saved afterwards as Reports for repetition of the search or as Panels (components) of Dashboards offering Visualizations of the results (in the form of graphs, charts and other available visualization forms).
All logs have two or three important fields to filter on to narrow the search to the relevant log events to be reviewed: The index field, the sourcetype field and usually also the openshift_namespace field. As described above, the index field describes a specific application, composed of the scale- prefix followed by the 3-letter appCode. The openshift_namespace field, as the name implies, represents the OpenShift namespace, which essentially differentiates the environment for the application (e.g. test environments versus production). The sourcetype field separates the different types (and formats) of the logs generated by an application, either because they are generated by different application components (e.g. a web UI versus a Springboot or Python backend) or by different elements of the application technology stack (e.g. application versus a data layer like MongoDB or MariaDB or components of the Kubernetes platform like OpenShift).
The Search screen is reached by clicking on the Search label at the upper left corner of the Splunk screen, but can also be reached by clicking one of the data sources listed (as described earlier) in the Application detail in cloudone dashboard. When entering the search screen in this manner, some search terms representing that selected data source (the index, **sourcetypeß and openshift_namespace fields) will already be pre-populated to narrow the search.
Search terms to filter the log data returned are defined as a series of field=value pairs separated by spaces. So an example of a search term for an application with a 3-letter appCode of xyz with a backend web service might look like the following:
index=cloueone-xyz sourcetype=access_extn openshift_namespace=xyz-prd-1
Independent of the search term, a period of time over which to search the logs is set in the upper right corner of the Search screen in the form of a dropdown list that opens to a set of options to select from common time periods called Presets (e.g Today, Last 24 hours, Last 3 days, Previous week, etc.) as well as to define more granular time periods by expanding on one of the alternative time selection methods: Relative, Date Range, Date & Time Range or Advanced.
Once the search terms are set and the time period is defined, the search is started by clicking on the green search icon (a picture of a magnifying glass). Splunk searches are jobs that run in the background and return data once completed. In some cases, a search may be blocked because of job quota limitations for the user ID logged into Splunk. This usually happens because Splunk is preserving past searches for a period time, but there is a limit to how many searches can be preserved at once for a given user. If a message is peroduced indicating this problem, click on the link to go to the Job Manager to delete and free up past searches and allow the new search to proceed.
The values assigned to fields in the search terms can also use wildcards. As an example, the above search term can be modified as follows to select log entries from all namespaces:
index=cloueone-xyz sourcetype=access_extn openshift_namespace=xyz-*
Another powerful capability of Splunk searches is the ability to aggregate the results by filtering through statistical functions. One common function is stats and can be used to, for example, count the events by a given field value. An example of this, building on the above search term, would look like the following:
index=cloueone-xyz sourcetype=access_extn openshift_namespace=xyz-* | stats count by openshift_namespace
Once this search is run, the results will be a list of OpenShift namespace (the openshift_namespace field) values with a count of the events for each field value. By clicking on one of these field values, View events can then be selected to pull up the events matching that particular field value (i.e. the field with that selected value will be added to the search term and returned) in order to drill down to a more refined set of events.
The search capabilities of Spunk include many powerful features. In order to learn more about searching in Splunk, refer to the following link: Splunk Search Manual or take the search tutorial at Splunk Search Tutorial. Another convenient reference is the this PDF file at Splunk Quick Reference Guide.
Creating a New Report
A search can be saved and run repeatedly as a report either on demand or automatically on a schedule. AFter conducting a search and being satisfied with the results and effectiveness of the search, click on the Save As dropdown list in the upper right corner of the Search window and select Report as the type of object to save. Then fill in a report title and description. More detailed instructions for creating reports can be found at this link: Splunk Create and Edit Reports.
When creating a new report in the SCALE environment, provide a prefix of the 3-letter application code (appCode) for the name of the report followed by a hyphen and a descriptive title (e.g. for a new report for an application with the 3-letter appCode of xyz, the dashboard name might be "xyz - MyApp Transaction Report"). This will ensure easy identification of which reports belong to which application teams. Despite the need for the application code prefix, Splunk does recommend trying to keep report names relatively short to avoid errors that can sometimes occur with long names.
Creating a New Dashboard
While the Splunk system for SCALE comes with some pre-built dashboards, it may be desirable for an application team to build additional dashboards to meet their own needs, unique to their applications. With appropriate Active Directory group assignments, creating new dashboards or panels in dashboards can be done within Splunk.
A dashboard panel is created by saving a search and creating a visualization from that search. Similar to how a report is made, from a search, click the Save As dropdown and now select Dashboard Panel as teh type of object to save and fill in the fields. More information about dashboards and visualizations can be found at this link: Splunk Dashboards and Visualizations. An example of the process of creating a dashboard can be followed in the guided tutorial at this link: Splunk Tutorial to Create Dashboard.
Similar to creating reports, when creating a new dashboard in the SCALE environment, provide a prefix of the 3-letter application code (appCode) for the name of the Dashboard followed by a hyphen and a descriptive title (e.g. for a new dashboard for an application with the 3-letter appCode of xyz, the dashboard name might be "xyz - MyApp Transaction Dashboard"). This will ensure easy identification of which dashboards belong to which application teams.