The main purpose of service delivery is to ensure proactive operations and that the ICT services deliver appropriate support for users. The purpose of service delivery is to focus on your organisation's needs. It is active learning, with use of ICT tools in the different subjects, the school needs. This chapter describes in order:
- Service level management
- Economy management
- Capacity management
- Capacity planning
- Access control
- Operational continuity
Service level management
Service Level Management is often shortened to the acronym SLA. Managing the the service level is about the quality of the operational services, measured in relation to what is agreed in a contract. There are definitely concrete figures for availability, response times, support, error correction etc.
The objective is to have control over the service level and improve the quality of the operational services. By repeating rounds the quality level is determined, monitored and reported. The purpose is to improve the contact between ICT administrators and users, to get an ICT service, to the agreed quality, delivered.
It is important to understand the different types of SLAs. One can choose from many types of agreements. The three most common types are:
- Agreement per service for all customers
- Agreement per customer for all services
- Agreement per service per customer
All SLAs are to be administerated, reported and maintained. It can quickly become confusing and produce much work that does not provide a particular benefit. The purpose is to get an agreement that helps to improve the quality of service. Therefore it is useful to think carefully about that, when the agreement is made. Here is an overview of what is important to make sure about when you create an agreement for the service level management.
- The agreement between the user and the operations of what's actually being measured. This must be seen from the users' perspective and not the ICT services perspective.
- Measurement and clarity about the metrics included in the SLA
- Decide realistic targets for the service level (there is no point in promising more than one can keep)
- Continuous focus on the control of the service - monitoring and periodic reporting of results achieved
It is essential that the operations center has the technical capability to measure the values included in the SLA. This must be taken into account from the beginning.
Furthermore, it is important to define the services where one is dependent on subcontractors and therefore can't provide service guarantees, or relies on a similar agreement with the subcontractor. The definition of dependencies is made because it should be clear who should rectify the problems, and to avoid never ending negotiations until the error could be corrected.
Level of service may be different for different user groups, or during different periods of the school year. For example, there may be differences between teachers and students, or a higher service quality when carrying out exams. Dialogue with all relevant users is important to ensure measuring of what's most important for each user group.
A service catalogue with all services included in the SLA must be prepared. A service will often be an application (program) in this directory. It will often be different requirements for different services, and it will be reflected in different objectives in the agreement.
Establishing and continually adjusting the users' expectations can't be overestimated. Often users have exaggerated expectations to the system and the services included. ICT services' responsibility is to adjust expectations down to realistic levels before the service-level agreement (SLA) is signed. Operations management must also ensure that all users actually are notified and know about the expected service level through the agreement.
For the structure of the SLA, see section in the service level agreement.
The operational situation
Monitoring of the actually achieved service levels, and reporting back to the customer, are essential to preserve a good relationship between the Service Desk and the users. Format and levels of detail for reporting, should be dealt with in the SLA.
It must be held periodic, for example quarterly or semiannually, meetings with the client. These meetings should result in concrete plans for the next period and, possibly, agreements for the implementation of new services.
Content of the Service Level Agreement (SLA)
Name and contact information for the Contracting Parties, description of the services included, duration of the agreement, responsibilities between the customer and the supplier.
During which time period would the agreement be valid (like from Monday to Friday, from 8:00 a.m. to 4:00 p.m.), any special requirements at defined dates and times (for example exams) and routines to order an expansion of the service time limits.
Access to the services. Is best measured as the period of time when one or more services have been unavailable, for example a calendar month. Different levels for different services may be agreed, for example depending on the degree of importance for users.
Important to emphasise that this is availability within the agreed period of service, not the overall availability all day, all week and all year round (called 24/7/365). For example, it may be agreed that the system should be available between 8:00 and 18:00 on workdays, after that and on weekends it is more uncertain whether one can use the computer system, unless otherwise agreed.
Availability also means getting support via phone or email. For example, whether the Service Desk can be reached between 08 and 16 during the day time, or if it can be reached the whole day, or in the afternoons and evenings, or even during specific weekends.
Is often measured according to the amount of downtime in a period of time, or the average time between downtimes. One can also measure the time it takes the system to come up again after downtime.
Often measured as response times by phone (for example 1 minute) or email (for example 30 minutes) to requests from users. When the operator gets a request for support, the message will be categorized by severity with a time guarantee for answers. There may also be an agreement about how quickly error correction will start, which will depend on what kind of inquiry was received.
The support is also about when during the day or night one can reach people. Should support be available during school hours between 08 and 16 o'clock, or should one also have support throughout the evening or on weekends. Some will have support also on certain holidays.
The period when support is available is usually in the SLA. It is also agreed what support will be available, with a fixed price, and what must be resolved additionally on an assignment basis. The agreement regulates the process of handling enquiries, both what to fix, and when this will happen.
Can be measured as the average response time by certain operations in specific applications. Will measure user experience of the system.
Measurement for the management, approval and implementation times of change requests from the users.
Can be measured as the number of ascertained security incidents in a period. It is very important to be clear on each user's responsibility to ensure that warranties will apply.
Prices, times for billing and settlement provisions.
Reporting and follow-up
Description of rules and periods for reporting of measured service levels. Regular meetings are recommended, for example quarterly, to go through the report and plan ahead.
Sanctions and possible incentives
Rules for price reduction if the agreed service is not met. Escalation procedures and rules for cancellation of agreement by continuous violations of guaranteed service level. Possible incentives for achievement or better than expected service.
See Appendix A for SLA.
Organisations rarely have a full overview of their ICT spending. A 2001-survey of Norwegian municipalities showed that only 1 of 8 municipalities had an ICT budget. It is probably not better for schools. Putting in place an ICT budget is important. Often users think they pay too much for a service they are not happy with. This often creates conflicts between users and the ICT department.
It is very useful for both the operations center and the users to document the real ICT costs. Without this, it is difficult to budget appropriately. And mostly, it is difficult to make a cost/benefit assessment of existing ICT solutions. The rector should know the ICT budget as well as she would know the salary budget, or the budget for the teaching aids.
There are three major key processes related to financial management of ICT services:
The objective of the budget is to make a realistic estimate of the expected ICT costs. Budgeting usually contains various alternative solutions. It applies both to equipment and software, and the level you aspire to. The budget is the starting point for subsequent budget negotiations with the director of education and/or politicians.
Budget must include both personnel and equipment costs. Some organisations only count the cost to buy equipment, omitting as much as 60 - 70 % in personnel costs for the operation of an ICT-solution. One must also get all of the equipment.
There are examples of municipalities forgetting to count the cost of power connectors and computer networks in schools. Then you have forgotten about 2000 NOK (10 NOK = 0.85 GBP/1.18 EUR) per client machine. For 70 new computers, we need about 140,000 NOK for computer networks and power.
Alternative solutions are also important to include in the budget. This applies both for the operation and the equipment. Today there are several vendors who specialize in the operation of computer equipment in schools with varying prices and quality. The number of simultaneous users, and type of machines and software to be maintained, is important.
If one would like to have laptops for all teachers and students one will easily get 5-6 times higher costs than if one had desktops with three students for each client machine.
Accounting will mainly consist of invoices for purchased equipment, cabling, repair, operations and extra services. When the accounting period is over, it is important to go through the numbers and compare this with the budget.
Planning the accounting and billing
Not all municipalities have accounting systems that show ICT costs detailed by school. There may be practical reasons for that, such as discounts and similar that the municipality gets centrally. Therefore it is important to do some planning so that you get an overview of what were the costs for operations and procurement when the accounting is assessed against the budget.
Some organisations may have cumbersome and costly accounting procedures. You quickly get extra charges if you pay bills late, or there are many who must approve a payment, for instance. So it is important to agree on good billing practices in procurement and operations in order to have control, as well as to handle payments on time without long decision processes.
The payment method is regulated by the SLA. When it gets to the accounting system, one must agree with the finance department for a convenient way to get out the reports, in order to get the necessary accounting overview of ICT costs without it taking a long time to be generated.
Regarding contracts one will usually have a fixed monthly billing consisting of a fixed amount and possible additional services. Billing is done from the accounting office based on the current operations' contracts, and the extra services performed. It is important to have good and frequent contact with the accounting service based on the tasks carried out for the customer.
Capacity planning is used to ensure that all parts of the ICT solution have sufficient capacity to safeguard users' requirements. This includes:
- Monitoring the performance of ICT services and their related infrastructure
- Configuration of the systems to ensure they are optimally utilised to what the users actually do
- Understanding the user needs and planning for possible changes in the systems to take care of future needs
- Resource planning in cooperation with the budget officer
- Preparation of a capacity plan to ensure delivery of operations in accordance with the agreed upon service level
Capacity planning is all about balance:
- Costs against capacity. The budget limits what kind of possible solutions one can implement
- Supply and demand. The systems must have the capacity to handle the demands set by the users
The objective of capacity planning is to avoid surprises.
It is essential for good capacity planning that the systems are continuously monitored to obtain the necessary data.
Typical data that is monitored is:
- Processor utilization
- Memory utilization
- CPU usage per task
- Response time per task for users
- Printer management - the number of prints, queue length, time for print outs
- Storage capacity
- Number of clients
- Number of logins
- Number of simultaneous users
In Debian Edu, Nagios is used as a monitoring tool.
On the basis of data collected from monitoring routines, one tries to identify any bottlenecks in the systems. Examples:
- Poor or varying utilization of the hardware
- Poorly designed software
- Poor utilization of memory capacity
- Bottlenecks on data storage, memory or processor
- Bottlenecks in the network
If the data analysis uncovers bottlenecks, one needs to try to set up the system in a way that better caters for the users' needs.
Here is a list of commonly encountered bottlenecks and what to do to get rid of them:
Missing sound, USB stick support and DVD on thin clients.
Install diskless workstations (> 800 MHz processor, > 256 MB RAM)
Has 60 thin clients connected to the server and wants more PCs.
Go for diskless clients, or install another thin client server
Thin clients run slowly after we expanded with 20 pieces without acquiring a new server machine
Install 2GB more memory on the server machine
Thin clients with 32MB memory do not start after upgrading to Skolelinux 2.0
Turn on swapping on the thin clients, or downgrade to LTSP 4.2 which is set up with swap.
Flash animations make the thin clients slow when 50 students are logged into the same server machine
Install diskless clients
Implementation of possible changes to the system configuration must be done in accordance with the guidelines set for changes of the system. A well-planned function and performance test must also be done before changes can be made in the production system. Testing is done to avoid operational disturbances when changes are set into production.
Preparing the capacity plan
A capacity plan is basically an investment plan for the ICT system based on knowledge of the users' current needs and future plans.
The capacity plan should be updated and processed once a year, normally in conjunction with the budget process. The plan should include the following themes:
- Current and future user needs
- Service summary
- Resource summary
- Areas for improvement
- Cost model
Good and stable availability of ICT services is obviously crucial for users.
Availability, seen from the user perspective, depends on the following assumptions:
- Availability of technical components
- Failure tolerance
- Quality of maintenance and support
- Procedures and routines for processing operational services
- Security, integrity and availability of data
Availability can be measured in several ways. But before we show examples we'll point out what may be difficult targeting figures. If we should make systematic efforts to availability, we have to clarify what the different things mean. What means for example a percentage of availability.
Let's say a "computer with computer program" is a service. If the computer program does not work one day, then the service is unavailable if all the other programs work fine. What if the computer program is unavailable for a classroom, but available for the rest of the school (because of an underlying service). This is a difficult matter to clarify and work on in practice.
Availability can be measured using several methods. Here are some examples:
The value can be availability between hours 08:00 and 18:00. If the system is down 1 hour during one day, than the system is available in 90% of the agreed upon time. If availability is measured over a month with 20 work days, then the system is available 95% of the time.
Is the system down one hour during an agreed uptime, for example 10 hours a day, the system is unavailable in 10% of the time. Measured over 20 days, we may assume the system has been unavailable for 5% of the time.
One can agree on the number of times one accepts the system to be unavailable during, for example, one month (20 days). It can be a maximum of one hour unavailability in that period, and between 08:00 until 18:00.
Even error frequency can be measured per day or for each month. 3 errors in the month because the system was down between 08:00 and 18:00, is an example.
Measured values are a common starting point for judging how to respond to an error beyond ordinary error correction. The customer or the school for example, may ask to pay less for the operating agreement for the current month.
The most important is that your measurements describe the user experience in the best possible way. Therefore, one should measure what is important for the user.
The feedback from the schools is that printers give most problems. This includes everything from the print queue has stopped, to missing paper or toner. Some have also experienced some instability with the browser, and that OpenOffice.org suite is hanging. It may happen when your broadband connection is unstable and you have links in documents going to the Internet.
To have a stable computing environment, one is dependent on a good enough technical quality of the network. Several schools have experienced instability because the physical computer network is provisional and of poor quality.
Today many invest in wireless networks. Doing so, one must also be aware of wireless networks having significant weaknesses. Wireless networks have limited capacity. It can be quite choppy when about 30 students are to see a film from the Internet simultaneously. Wireless networks also have shadows. Meaning areas may not get coverage, which allows some to end up in blind zones. This would provide poor or no net connection at all.
Availability requirements for the maintenance company and ICT service providers should specify good quality of network services to schools.
«Single points of failure»
Some parts of the system simply must work. Failures in a firewall, for example, may compromise security or (if you're lucky) shut down the whole network. This last can also result from problems with the DHCP (Dynamic Host Configuration Protocol) system for sharing out addresses.
The operating department has a responsibility to know which parts may stop the entire system. It is important to find these points, and remove the errors one by one, to the extent you can afford. If one can't afford to remove these sources of errors, one must live with the risk of the entire computer network suddenly grinding to a halt.
Sources of errors making everything stop, may also be logical rather than physical. This is especially true for computer networks and databases. So it is important to have a broader perspective when it comes to such errors.
One must consider what risks one accepts in the network. Is it acceptable that users lose personal files and data, when a hard drive fails? How quickly should one replace broken equipment? Some schools have spent several days getting a server up and running again after a virus attack. The municipality may have no resources to allocate to fix errors.
Much of the operations work goes on to maintain the agreed service level. It's about avoiding and losing confidence and user satisfaction. Risk management is about having in place the appropriate resources to keep the entire computer system on the air, and have resources ready if something should go wrong, and needs to be fixed.
It is a big difference to install equipment and software on a single PC and to do it on hundreds, even thousands of computers. With the responsibility for hundreds of machines, a small error that one can live with on a PC, means much instability and discontent if it affects hundreds of users.
To avoid making mistakes during installation and to contribute to stability, it is essential to test the equipment and software to be used. It's about following up the expected quality. If you want stable operations one must often choose next to the last edition of equipment and software.
One should avoid adopting software with a version number ending with a zero. For example you should avoid OpenOffice.org 4.0. One should adopt the office program when version 4.0.2 has arrived or later. Then the program has been fixed for several errors. The same applies to hardware.
Server machines have usually a slightly older version of processors, and more robust memory, and hard drives. This is because many people use this hardware simultaneously. A small error that would not mean anything for one user, can provide downtime if 30 users are logged into the machine.
So testing is about to use proven equipment and editions of software running well a half or a year. Testing is also about trying out the different parts in a smaller but realistic context, to ensure that everything works. Adopting the latest version, or even beta versions of software or completely latest hardware usually lead to much trouble and extra work with maintenance. Setting systems in production without a small test in realistic environments usually lead to significant firefighting and dissatisfied users.
When testing in a smaller scale on equipment in production, it is essential to coordinate that with those affected. In addition, one must choose when to test. One should not test new things, for example, under ongoing exams, with the use of ICT tools.
It is often worth an operations department's while to enhance systems that produce many operational messages. If users get much spam, then it might be wise to install spam-filters. There might be a lot of extra work with students who constantly forget their passwords, if teachers have to get central sysadmin staff to help them out. To avoid extra emailing and double work, the teacher can be authorised to give the student a new password.
These are two examples of design improvements that simplified maintenance and made users happier. A well-run maintenance team has a prioritised list of such improvements. Prioritising these, as a rule, is based on how often relevant issues show up in the service office's message log and estimates of how much work each improvement shall involve.
Planning for availability
It means having realistic expectations to the ICT service based on what operations costs. Plan for what's the expected accessibility. For example, when schools require one should be up and running in less than 1 hour after the server crashes, one must have a standing pre-installed machine in reserve, to be inserted as a replacement for the faulty machine. What should be done during one hour is to copy your backup files to the backup machine.
For when a diskless or thin client fails, the school should have a small store of machines and monitors prepared. The school ICT contact can fetch and install a replacement machine. This can be done easily without waiting days for an equipment order to be filled.
Planning for recovery
As for equipment standing ready to replace any that develop defects, users also expect to be able to retrieve lost files and data. Therefore it is crucial to back up user data regularly and keep a copy of the configuration files. One must also have architectural diagrams, and descriptions of systems, to enable ICT staff to quickly set systems up when something goes wrong.
It is crucial to schedule backup of user data and settings. One must plan ahead in order to have proper equipment and appropriate services. Routines must be planned to be followed when certain error situations occur and systems must be restored.
Operating continuity or continuity management is often the most costly part of the work. High demands to operational continuity will require huge investments, which must be agreed upon whilst making the SLA. For example it can be agreed that there is no disaster plan for certain services. If you have a disaster plan the value is very low if not tested once in a while. Usually this is expensive. There are examples where customers and management have blocked the engine room and turned off power to test readiness of the IT department.
Operating continuity may be more needed in certain periods like under the examination periods. Then extra requirements can be needed in order to have equipments with backup ready in case of a hard disk failure on the server. But even this will require considerable additional work for the operational staff.
An IT coordinator told us that it might be just as well to postpone the exam one day, if something went wrong with the computer system. This costs a lot less than having a double number of servers at each school. There are examples of schools having had water leakage. Then it is usual to defer examination a day or two to repair the damage . One might think the same way when it comes to school data solutions. If you have a backup of home directories for pupils and teachers, you have time to consider without doubling the systems at each school. Then it is sufficient with one or two servers in reserve located at the municipality building, which quickly can be moved and connected at the school if something goes wrong.